19 research outputs found
Run-Time Efficient RNN Compression for Inference on Edge Devices
Recurrent neural networks can be large and compute-intensive, yet many
applications that benefit from RNNs run on small devices with very limited
compute and storage capabilities while still having run-time constraints. As a
result, there is a need for compression techniques that can achieve significant
compression without negatively impacting inference run-time and task accuracy.
This paper explores a new compressed RNN cell implementation called Hybrid
Matrix Decomposition (HMD) that achieves this dual objective. This scheme
divides the weight matrix into two parts - an unconstrained upper half and a
lower half composed of rank-1 blocks. This results in output features where the
upper sub-vector has "richer" features while the lower-sub vector has
"constrained features". HMD can compress RNNs by a factor of 2-4x while having
a faster run-time than pruning (Zhu &Gupta, 2017) and retaining more model
accuracy than matrix factorization (Grachev et al., 2017). We evaluate this
technique on 5 benchmarks spanning 3 different applications, illustrating its
generality in the domain of edge computing.Comment: Published at 4th edition of Workshop on Energy Efficient Machine
Learning and Cognitive Computing for Embedded Applications at International
Symposium of Computer Architecture 2019, Phoenix, Arizona
(https://www.emc2-workshop.com/isca-19) colocated with ISCA 201